The provided cardiovascular dataset is available for academic purposes, aimed at classifying whether a patient may have cardiovascular disease. Stated differently, this data was collected and created to build and train models using a variety of predictors to classify a patient as having cardiovascular disease or not having cardiovascular disease. The outcomes of the cardiovascular dataset can be defined as a classification problem of a categorical response variable. For example, to determine whether smoking is a predictor of cardiovascular disease and using accuracy statistics to measure the effectiveness of smoking as a predictor. If the model correctly predicts the presence of cardiovascular disease in a given a patient 70% of the time actions such as early, noninvasive, and safe intervention therapies (such as with diet and exercise), may replace necessary (surgical, and dangerous interventions) as an undiagnosed problem worsens. A low false-negative rate, that is incorrectly diagnosing a patient as not having cardiovascular disease may be a beneficial measure of having mined useful knowledge. Given the categorical response variable and classification of cardiovascular disease, the effectiveness of the prediction algorithm may be measured through accuracy or precision statistics, ROC curves or cross-validation (“CV”), for example a 10-fold CV.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
import plotly.express as px
%matplotlib inline
df = pd.read_csv('../data/cardio_train.csv', delimiter=';')
The cardiovascular dataset consists of 11 features and 1 target variable. Attribute descriptions can be found on Kaggle. They are provided below.
There are 3 types of input features in the data set.
Table 1: Cardiovascular Dataset - Attribute Descriptions
| Column Description | Feature Type | Column Name | Data Type |
|---|---|---|---|
| Age | Objective | age | int (days) |
| Height | Objective | height | int (cm) |
| Weight | Objective | weight | float (kg) |
| Gender | Objective | gender | 1: female, 2: male |
| Systolic blood pressure | Examination | ap_hi | int |
| Diastolic blood pressure | Examination | ap_lo | int |
| Cholesterol | Examination | cholesterol | 1: normal, 2: above normal, 3: well above normal |
| Glucose | Examination | gluc | 1: normal, 2: above normal, 3: well above normal |
| Smoking | Subjective | smoke | binary |
| Alcohol intake | Subjective | alco | binary |
| Physical activity | Subjective | active | binary |
| Has CVD? | Target | cardio | binary |
df.head()
We decided to use the id column as the index as it appears to be a unique identifier for the subject.
df.set_index("id", inplace=True)
df.index.is_unique
df.shape
Our data set has 70 thousand rows and 12 columns, which should work nicely for many of the machine learning classifiers we may attempt to utilize.
cols = df.columns
Store columns for later use.
cols
df.isna().any()
There are no missing values in any of the columns of the default data. But upon furhter inspection we found that zeros or other values may have been used in place of missing entries which should also be addressed.
df.duplicated().any()
There do appear to be missing values.
df.duplicated().sum()
There were a total of 24 sets where all columns of an observation equal at least one other record in the set. It was important to remove id as part of the data frame otherwise these duplicated entries would have been more difficult to detect.
duplicated = df[df.duplicated(keep=False)].sort_values(by=['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol',
'gluc', 'smoke', 'alco', 'active', 'cardio'])
duplicated.head(10)
df_clean = df.copy(deep=True)
df_clean.drop_duplicates(inplace=True)
We'll remove the duplicates entirely, as this should not affect the ability of our models to make predictions with the amount of observations at our disposal.
# %%time
# df_clean['age'] = df_clean['age'].apply(lambda x: round(x / 365))
df_clean['age'] = (df_clean['age'] / 365).round().astype('int')
Age was provided in days, and for the sake of interpretability we'll be converting this to years for all observations.
plt.style.use('ggplot')
fig_1 = plt.figure(1, figsize=(20, 5))
chart_1 = fig_1.add_subplot(121)
chart_2 = fig_1.add_subplot(122)
chart_1.hist(df_clean["age"])
chart_1.xaxis.set_major_locator(MaxNLocator(integer=True))
chart_1.set_title('Histogram of Age')
chart_1.set_xlabel('Age (in years)')
chart_1.set_ylabel('Frequency')
sns.boxplot(x="age", data=df_clean, ax=chart_2)
chart_2.xaxis.set_major_locator(MaxNLocator(integer=True))
chart_2.set_title('Boxplot of Age')
chart_2.set_xlabel('Age (in years)')
plt.show()
Age has relatively few outliers but is slightly right-skewed.
fig_2 = plt.figure(1, figsize=(20, 5))
chart_1 = fig_2.add_subplot(121)
chart_2 = fig_2.add_subplot(122)
chart_1.hist(df_clean["height"])
chart_1.xaxis.set_major_locator(MaxNLocator(integer=True))
chart_1.set_title('Histogram of Height')
chart_1.set_xlabel('Height (in cm)')
chart_1.set_ylabel('Frequency')
sns.boxplot(x="height", data=df_clean, ax=chart_2)
chart_2.xaxis.set_major_locator(MaxNLocator(integer=True))
chart_2.set_title('Boxplot of Height')
chart_2.set_xlabel('Height (in cm)')
plt.show()
There are quite a few outliers in the height column that should be addressed. The largest of 250cm is over 8 feet tall and appears to be an error.
fig_3 = plt.figure(1, figsize=(20, 5))
chart_1 = fig_3.add_subplot(121)
chart_2 = fig_3.add_subplot(122)
chart_1.hist(df_clean["weight"])
chart_1.xaxis.set_major_locator(MaxNLocator(integer=True))
chart_1.set_title('Histogram of Weight')
chart_1.set_xlabel('Weight (in kg)')
chart_1.set_ylabel('Frequency')
sns.boxplot(x="weight", data=df_clean, ax=chart_2)
chart_2.xaxis.set_major_locator(MaxNLocator(integer=True))
chart_2.set_title('Boxplot of Weight')
chart_2.set_xlabel('Weight (in kg)')
plt.show()
Similarly, there are a lot of outliers in the weight column as well.
fig_4 = plt.figure(1, figsize=(20, 5))
chart_1 = fig_4.add_subplot(121)
chart_2 = fig_4.add_subplot(122)
chart_1.hist(df_clean["ap_hi"])
chart_1.xaxis.set_major_locator(MaxNLocator(integer=True))
chart_1.set_title('Histogram of Systolic blood pressure')
chart_1.set_xlabel('ap_hi')
chart_1.set_ylabel('Frequency')
sns.boxplot(x="ap_hi", data=df_clean, ax=chart_2)
chart_2.xaxis.set_major_locator(MaxNLocator(integer=True))
chart_2.set_title('Boxplot of Systolic blood pressure')
chart_2.set_xlabel('ap_hi')
plt.show()
The distribution of the Systolic blood pressure was quite unusual with several readings that were likely erroneous.
df_clean["ap_hi"].sample(10)
A random sample show values within the expected range.
df_clean["ap_hi"].sort_values()
But there were negative values and extremely high ones that should be reviewed. We'll address these outliers later within the imputation section.
# df_clean = df_clean[~(df_clean['ap_hi'] < 40) & (df_clean['ap_hi'] < 300)]
# df_clean.shape[0]
fig_5 = plt.figure(1, figsize=(20, 5))
chart_1 = fig_5.add_subplot(121)
chart_2 = fig_5.add_subplot(122)
chart_1.hist(df_clean["ap_lo"])
chart_1.xaxis.set_major_locator(MaxNLocator(integer=True))
chart_1.set_title('Histogram of Diastolic blood pressure')
chart_1.set_xlabel('ap_lo')
chart_1.set_ylabel('Frequency')
sns.boxplot(x="ap_lo", data=df_clean, ax=chart_2)
chart_2.xaxis.set_major_locator(MaxNLocator(integer=True))
chart_2.set_title('Boxplot of Diastolic blood pressure')
chart_2.set_xlabel('ap_lo')
plt.show()
df_clean["ap_lo"].sample(10)
df_clean["ap_lo"].sort_values()
The same technique should be applied to the ap_lo feature.
fig_0 = plt.figure(1, figsize=(20, 5))
chart_1 = fig_0.add_subplot(121)
chart_2 = fig_0.add_subplot(122)
sns.countplot(x="cardio", hue="cardio", data=df_clean, ax=chart_1)
# chart_1.legend(bbox_to_anchor=(1,1), title='CVD')
chart_1.set_title('Cardiovascular Disease')
chart_1.set_xlabel('CVD')
chart_1.set_ylabel('Count')
pd.crosstab(df_clean["cholesterol"], df_clean["cardio"]).plot(kind="bar", ax=chart_2)
chart_2.set_title('CVD by Cholesterol Level')
chart_2.set_xlabel('Cholesterol Level')
chart_2.set_ylabel('Count')
plt.show()
The Cardiovascular Disease (CVD) response variable is equally distributed. The presence or absence of CVD does seem to change with the cholesterol levels.
fig_6 = plt.figure(1, figsize=(20, 5))
chart_1 = fig_6.add_subplot(121)
chart_2 = fig_6.add_subplot(122)
pd.crosstab(df_clean["gender"], df_clean["cardio"]).plot(kind="bar", ax=chart_1)
chart_1.set_title('CVD by Gender')
chart_1.set_xlabel('Gender')
chart_1.set_ylabel('Count')
df_clean.groupby(['gender', 'height']).sum().reset_index()
df_clean.groupby(['gender', 'weight']).mean().reset_index().groupby('gender')['weight'].mean().plot(kind='bar', ax=chart_2)
chart_2.set_title('Average Weight by Gender')
chart_2.set_xlabel('Gender')
chart_2.set_ylabel('Weight')
plt.show()
There are more subjects with the label 1 in the study than those with label 2.
We're going to assume that label 2 is for male as the mean weight is slightly heigher for that category.
fig_7 = plt.figure(1, figsize=(20, 5))
chart_1 = fig_7.add_subplot(121)
chart_2 = fig_7.add_subplot(122)
pd.crosstab(df_clean["gluc"], df_clean["cardio"]).plot(kind="bar", ax=chart_1)
chart_1.set_title('CVD by Glucose Level')
chart_1.set_xlabel('Glucose Level')
chart_1.set_ylabel('Count')
pd.crosstab(df_clean["smoke"], df_clean["cardio"]).plot(kind="bar", ax=chart_2)
chart_2.set_title('CVD by Smoking Level')
chart_2.set_xlabel('Smoking Level')
chart_2.set_ylabel('Count')
plt.show()
The presence or absence of CVD also changes with the glucose levels but suprsingly not with smoking.
df_clean.groupby(['gluc', 'cardio']).size().unstack(fill_value=0)
There are only a few thousand entries within levels 2 and 3 of the Choleserol column.
fig_8 = plt.figure(1, figsize=(20, 5))
chart_1 = fig_8.add_subplot(121)
chart_2 = fig_8.add_subplot(122)
pd.crosstab(df_clean["alco"], df_clean["cardio"]).plot(kind="bar", ax=chart_1)
chart_1.set_title('CVD by Alchohol Level')
chart_1.set_xlabel('Alchohol Level')
chart_1.set_ylabel('Count')
pd.crosstab(df_clean["active"], df_clean["cardio"]).plot(kind="bar", ax=chart_2)
chart_2.set_title('CVD by Activity Level')
chart_2.set_xlabel('Activity Level')
chart_2.set_ylabel('Count')
plt.show()
Also suprising is that the Alcohol level didn't seem to have an impact on the response variable. The Activity Level did show good seperation.
median_age = df_clean['age'].median()
age_outlier_ids = df_clean.index[(np.abs(df_clean['age'] - df_clean['age'].mean()) > (3 * df_clean['age'].std()))]
df_clean.loc[df_clean.index.isin(age_outlier_ids), "age"] = median_age
df_clean.loc[df_clean.index.isin(age_outlier_ids)].head()
We've imputed observations with an age of more than three standard deviations from the mean with the median value. (~4 observations in total) These may well have been valid observations but we wanted our model to extend well to other new and unseen data sets.
median_height = df_clean['height'].median()
height_outlier_ids = df_clean.index[(np.abs(df_clean['height'] - df_clean['height'].mean()) > (3 * df_clean['height'].std()))]
df_clean.loc[df_clean.index.isin(height_outlier_ids), "height"] = median_height
df_clean.loc[df_clean.index.isin(height_outlier_ids)].head()
We've imputed observations with a height more than three standard deviations from the mean with the median value. (~287 observations in total)
median_weight = df_clean['weight'].median()
weight_outlier_ids = df_clean.index[(np.abs(df_clean['weight'] - df_clean['weight'].mean()) > (3 * df_clean['weight'].std()))]
df_clean.loc[df_clean.index.isin(weight_outlier_ids), "weight"] = median_weight
df_clean.loc[df_clean.index.isin(weight_outlier_ids)].head()
The same method was applied to the 702 weight outliers.
median_ap_hi = df_clean['ap_hi'].median()
ap_hi_outlier_ids = df_clean.index[(np.abs(df_clean['ap_hi'] - df_clean['ap_hi'].mean()) > (3 * df_clean['ap_hi'].std()))]
df_clean.loc[df_clean.index.isin(ap_hi_outlier_ids), "ap_hi"] = median_ap_hi
df_clean.loc[df_clean.index.isin(ap_hi_outlier_ids)].head()
ap_hi_outlier_ids2 = df_clean.index[(df_clean['ap_hi'] < 40) | (df_clean['ap_hi'] > 300)]
df_clean.loc[df_clean.index.isin(ap_hi_outlier_ids2), "ap_hi"] = median_ap_hi
Even after handling the ap_hi outliers through the standard deviation method, there were still some unusual entries that were manually addressed. (readings less that 40 or greater than 300)
median_ap_lo = df_clean['ap_lo'].median()
ap_lo_outlier_ids = df_clean.index[(np.abs(df_clean['ap_lo'] - df_clean['ap_lo'].mean()) > (3 * df_clean['ap_lo'].std()))]
df_clean.loc[df_clean.index.isin(ap_lo_outlier_ids), "ap_lo"] = median_ap_lo
df_clean.loc[df_clean.index.isin(ap_lo_outlier_ids)].head()
ap_lo_outlier_ids2 = df_clean.index[(df_clean['ap_lo'] < 40) | (df_clean['ap_lo'] > 300)]
df_clean.loc[df_clean.index.isin(ap_lo_outlier_ids2), "ap_lo"] = median_ap_lo
The ap_lo feature needed similar processing.
df_clean.describe()
In the table above we have an overview of all of our attributes included in the dataset. This gives an idea of our counts after the cleaning of our data and provides us with data we can hope to draw inferences from. Below we will look at the simplest level of some ofour attributes to review if some general assumptions are true or false within our given data.
df_clean.agg({'age':['min','max','median','skew','std'],'gender':['min','max','median','skew','std'],'height':['min','max','median','skew','std'],'weight':['min','max','median','skew','std'],'ap_hi':['min','max','median','skew','std'],'ap_lo':['min','max','median','skew','std']})
This table narrows down our quantitative variables, giving us a quick look at the balanceof the dataset. The variable "skew" helps us see in which way our observations lean, none of which were too far out of preferred parameters. We noticed that age of our population is slightly younger (leaning towards our min of 39), gender favors men slightly (as it is a higher value and 2 represents male), height is almost indiscernible, weight favors our max slightly, and blood pressure is slightly higher in our dataset. As stated previously, with the clean data we feel comfortable making assumptions from this data and have decided to move forward.
df_clean.groupby("cholesterol").mean()
Cholesterol has been a topic of debate for years given advancements in technology. We felt that this would possibly give us some insight into the weight that it would have in regards to classifying cardiovascular disease. This attribute classifies 1 as normal levels of cholesterol, 2 as being above normal, and 3 being well above normal
df_clean.groupby("smoke").mean()
As made clear from numerous medical studies, smoking can greatly affect health. In thisattribute we hold that 0 is a non-smoker and 1 is a smoker. We observed here that height was higher in the smoking population (which could be attributed to more male smokers than females). Weight was slightly higher which could also correspond with the skewed height, but it is also known that smokers do generally weigh more than non-smokers from previous medical studies. What was most expected is that the overall blood pressure (both diastolic and systolic) were both higher than non-smokers.
df_clean.groupby("active").mean()
Lastly, we wanted to evaluate the physical activity of our population. This attribute's levels are 0 for non-active and 1 for active. What was curious about this attribute is thatthe age is not skewed towards either older nor younger. Only ever so slightly is the age skewed towards younger persons, but generally physical activity is not determined by age through this dataset, where normally we'd expect as a population ages, they become less active.Weight was another variable we felt we'd expect to see a difference between active andnon-active. Here we did find this to be true as those who are active were approximately a half kilogram lighter than non-active. This wasn't as great as we would have expected,but it does prove our general assumption. Additionally, we'd really expect the systolic and diastolic blood pressures to be affected by this variable, however there was notable nearly no change here as well. From this table, we'd draw a conclusion that physical activity does not necessarily help prevent or predict cardiovascular disease, but much more investigation is required to prove or disprove this entirely as well.
fig = px.scatter_3d(df_clean.sample(200), x='ap_hi', y='ap_lo', z='age',
color='cardio')
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
import base64, io, IPython
from PIL import Image as PILImage
image = PILImage.open("../img/3d_scatterplot.png")
output = io.BytesIO()
image.save(output, format='PNG')
encoded_string = base64.b64encode(output.getvalue()).decode()
html = '<img src="data:image/png;base64,{}"/>'.format(encoded_string)
IPython.display.HTML(html)
## PDF OUTPUT ONLY
# <img src = "../img/3d_scatterplot.png" >
The 3D scatterplot shows higher CVD response rates from ap_hi and ap_lo but not for age.
df_sub3 = df_clean[['age', 'weight', 'cholesterol', 'ap_hi', 'cardio', 'ap_lo', 'smoke', 'active']]
df_sub3.cardio = df_sub3.cardio=='1'
#Adapted from https://towardsdatascience.com/the-art-of-effective-visualization-of-multi-dimensional-data-6c7202990c57
ax = sns.kdeplot(df_sub3['age'], df_sub3['weight'],
cmap="YlOrBr", shade=True, shade_lowest=False)
The above Kernal Density plot depicts the density of those with cardiovascular disease by age and weight. As can be seen, the most dense areas are shown with an age value of 54-56 and a weight value of about 70. This may be indicative of ages and weights that may be predictive of cardiovascular disease.
We can create a feature correlation heatmap using Seaborn.
features = ['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']
plt.figure(figsize=(15,15))
# Use an easier to see colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Mask
correlation = df_clean[features].corr()
correlation[np.abs(correlation)<.2] = 0
sns.heatmap(correlation, annot = True, cmap=cmap).set(title = 'Correlation Heatmap')
plt.show()
The above heatmap shows some interesting correlations, namely between
Additionally, we see the following are correlated with out target variable, 'cardio'.
Let's cross tabulate a few of these and inspect further.
fig = plt.figure(1, figsize=(20, 5))
chart_1 = fig.add_subplot(131)
chart_2 = fig.add_subplot(132)
chart_3 = fig.add_subplot(133)
pd.crosstab(df_clean["gluc"], df_clean["cholesterol"]).apply(lambda r: r/r.sum(), axis=1).plot(kind="bar", ax=chart_1)
chart_1.set_title('Glucose vs Cholesterol')
chart_1.set_xlabel('Glucose')
chart_1.set_ylabel('Percentage')
pd.crosstab(df_clean["alco"], df_clean["smoke"]).apply(lambda r: r/r.sum(), axis=1).plot(kind="bar", ax=chart_2)
chart_2.set_title('Alcohol vs Smoking')
chart_2.set_xlabel('Alcohol')
chart_2.set_ylabel('Percentage')
pd.crosstab(df_clean["gender"], df_clean["smoke"]).apply(lambda r: r/r.sum(), axis=1).plot(kind="bar", ax=chart_3)
chart_3.set_title('Gender vs Smoking')
chart_3.set_xlabel('Gender')
chart_3.set_ylabel('Percentage')
plt.show()
Some observations based on the above charts:
#Scatterplot Matrix
cmap = sns.diverging_palette(220, 10, as_cmap=True)
df_clean_jitter = df_clean[['cardio', 'ap_hi', 'ap_lo', 'cholesterol', 'weight', 'age']]
df_clean_jitter[['cholesterol', 'age', 'weight']] = df_clean_jitter[['cholesterol', 'age', 'weight']].values + np.random.rand(len(df_clean_jitter),3)/2
sns.pairplot(df_clean_jitter, hue="cardio", height=2)
The above pairwise scatter plot depicts the distribtuions of each of those that had cardiovascular disease and those that did not for each attribute across the center diagonal. As show in those distributions, there are apparent differences in those that had cardiovascular disease and those that did not. For example, as will be discussed in more, below details, the distrbutions show that as blood pressure, cholesterol, weight and age increase, the number of those that had cardiovascular disease also increases. In addition, the pairwise scatter plot shows scatterplot distrbutions between each of the attributes. Some seperation between those that had cardiovasular disease and those that did not can be seen in the plots between weight and age, cholesterol and weight, and age and blood pressure, despite the presence of some values seemingly ranndomly distributed. Therefore, these attributes may be indicative of cardiovascular disease.
f, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(x="cholesterol", y="weight", hue="cardio", data=df_clean,
split=True, inner="quart")
f, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(x="cholesterol", y="age", hue="cardio", data=df_clean,
split=True, inner="quart")
The above violin plots show the relationship between cholesterol and weight between groups having cardiovascular disease (1), and not having cardiovascular disease. The violin plots show that as the level of cholesterol increases from 1-3, the median difference in weight between the cardiovascular and non-cardiovascular disease groups also increases. It is also apparent that the median weight for cholesterol levels 2-3 for each group represents an increase from cholesterol level 1. This is consistent with the understanding that as cholesterol levels increase, weight also increases.
The violin plots depicting cholesterol and age shows that the distribution between those that had cardiovascular disease and those that did not is consistent across cholesterol levels 2 and 3. For example, the difference between the median ages of those that had cardiovasular disease and those that did not for cholesterol levels 2-3 is approximately equal. This is consistent with that people typically increase cholesterol levels with age as activites such as exercise decrease. However, in the distributions between those that had cardiovascular disease and those that did not for cholesterol level 1 shows that age is a potential indicator of cardiovascular disease. The distrbution that had cardiovascular disease at cholesterol level 1 is approximately 56, which is potentially, significantly greater than the median age of those that did not have cardiovascular disease at 52.
f, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(x="cholesterol", y="ap_hi", hue="cardio", data=df_clean,
split=True, inner="quart")
f, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(x="cholesterol", y="ap_lo", hue="cardio", data=df_clean,
split=True, inner="quart")
Violin plots of blood pressure and cholesterol between those that had and did not have cardiovascular disease appears to include multiple peaks. Otherwise, it appears that those with lower overall blood pressure, between low and high pressure values, suffered less cardiovascular disease. As can be seen, as both low and high blood pressure values increase, the distrbution of those with cardiovascular disease also increases. Therefore, it appears as if blood pressure is an indicator of cardiovascular disease.
import plotly.express as px
df_sub2 = df_clean[['age', 'weight', 'cholesterol', 'ap_hi', 'cardio', 'ap_lo']]
#Normalizing the data
df_normalized2 = (df_sub2-df_sub2.mean())/(df_sub2.std())
df_normalized2.cardio = df_sub2.cardio
df_normalized2.cholesterol = df_normalized2.cholesterol+np.random.rand(*df_normalized2.cholesterol.shape)/2
df_normalized2.ap_hi = df_normalized2.ap_hi+np.random.rand(*df_normalized2.ap_hi.shape)/2
fig = px.parallel_coordinates(df_sub2, color='cardio', labels={"age": "Age", "weight": "Weight", "cholesterol": "Cholesterol", "ap_hi": "High Blood Pressure Value",
"cardio": "Cardiovascular Disease", "ap_lo": "Low Blood Pressure Value"},
color_continuous_scale=px.colors.diverging.Tealrose,
color_continuous_midpoint=1)
fig.show()
The above parallel coordinates plot shows interconnections between those that had cardiovascular disease and those that did not, and the attributes that appear to be indicators of cardiovascular disease. Here, those with cardiovascular disease are shown in beige, and those without cardiovascular disease is shown in teal. In the above plot, separation can be seen between age and weight around 43, suggesting that age and weight may interact in predicting cardiovascular disease. Some separation between weight, cholesterol and high blood pressure indicating that as weight, cholesterol and high blood pressure values increase, cardiovascular disease also increases. Lastly, as shown, there is more influence of cardiovascular disease as all of the attributes increase.
from pandas.plotting import parallel_coordinates
#df_sub2 = df_clean[['age', 'weight', 'cholesterol', 'ap_hi', 'cardio', 'ap_lo']]
df_sub = df_clean[['age', 'weight', 'cholesterol', 'ap_hi', 'cardio', 'ap_lo']]
df_sub.cardio = df_sub.cardio=='1'
#normalizing values
df_normalized = (df_sub-df_sub.mean())/(df_sub.std())
df_normalized.cardio = df_sub.cardio
df_normalized.cholesterol = df_normalized.cholesterol+np.random.rand(*df_normalized.cholesterol.shape)/2
df_normalized.ap_hi = df_normalized.ap_hi+np.random.rand(*df_normalized.ap_hi.shape)/2
parallel_coordinates(df_normalized,'cardio')
plt.show()
The above parallel coordinates plot shows the axes of the attrivutes for the group that includes cardiovascular disease. Eliminating the group that does not have cardiovascular disease shows more detail after normalization. For example, there appears to be age and weight bands that may be indicative of cardiovascular disease such that outside of this band is indicative of not having cardiovascular disease. Lastly, while most of the axes are clustered 2 and -2, more axes extend from the high values of high and low blood pressure. This is indicative of an increase in cardiovascular disease as the values for high and low blood pressure increase, which also fits the domain knowledge of the problem.
plt.figure(figsize=(10,10))
df_clean['bmi'] = df_clean['weight'] / (df_clean['height']/100)**2
sns.boxplot(x='cardio',
y='bmi',
data=df_clean,
palette="Set1").set(title = 'Cardiovascular Disease and BMI',
xlabel = 'Presence of Cardiovascular Disease?',
ylabel = 'BMI')
plt.show()
For entertainment, we can use some of what we learned in stats so that we don't forget, and do a ttest between the diseased and healthy group. It shows what would expect, that there is some statistical signficance in the mean difference of BMI between the two groups (healthy vs diseased).
from scipy import stats
import researchpy as rp
# Let's create 2 sets, one for disease, and another for healthy
disease = df_clean[df_clean['cardio'] == 1]
disease.reset_index(inplace = True)
healthy = df_clean[df_clean['cardio'] == 0]
disease.reset_index(inplace = True)
var='bmi'
# diff = disease[var] - healthy[var]
# stats.probplot(diff, plot= plt)
# plt.title('BMI P-P Plot')
# stats.ttest_ind(disease[var], healthy[var]))
descriptives, results = rp.ttest(disease[var], healthy[var], equal_variances=False)
results
We could also create a BMI category to represent the following four cases. Source
As expected, we can see from that a higher BMI group correlates with a higher chance of being diagnosed with cardiovascular disease.
df_clean['bmiGrp'] = np.where((df_clean.bmi < 18.5), 1, 0)
df_clean['bmiGrp'] = np.where((df_clean.bmi >= 18.5) & (df_clean.bmi < 25), 2, df_clean.bmiGrp)
df_clean['bmiGrp'] = np.where((df_clean.bmi >= 25) & (df_clean.bmi < 30), 3, df_clean.bmiGrp)
df_clean['bmiGrp'] = np.where((df_clean.bmi >= 30), 4, df_clean.bmiGrp)
df_grouped = df_clean.groupby(by=['bmiGrp'])
print ("Percentage of Caridovascular Disease in each BMI group:")
print (df_grouped.cardio.sum() / df_grouped.cardio.count() *100)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,8))
sns.boxplot(x='cardio', y='bmiGrp', data=df_clean, ax=ax1)
#sns.barplot(x='bp', y='cardio', data=df, ax=ax2)
cardio = pd.crosstab([df_clean['bmiGrp']],df_clean.cardio.astype(bool))
cardo_rate = cardio.div(cardio.sum(1).astype(float), axis=0) # normalize the value
cardo_rate.plot(kind='barh', stacked=True, ax=ax2)
plt.show()
# Create blood pressure categories
df_clean['bp'] = np.where((df_clean.ap_hi < 120) & (df_clean.ap_lo < 80), 1, 0)
df_clean['bp'] = np.where((df_clean.ap_hi >= 120) & (df_clean.ap_hi < 130) & (df_clean.ap_lo < 80), 2, df_clean.bp)
df_clean['bp'] = np.where((df_clean.ap_hi >= 130) & (df_clean.ap_hi < 140) | ((df_clean.ap_lo >= 80) & (df_clean.ap_lo < 90)), 3, df_clean.bp)
df_clean['bp'] = np.where((df_clean.ap_hi >= 140) | (df_clean.ap_lo >= 90), 4, df_clean.bp)
df_clean['bp'] = np.where((df_clean.ap_hi > 180) | (df_clean.ap_lo > 120), 5, df_clean.bp)
df_grouped = df_clean.groupby(by=['bp'])
print ("Percentage of cardio disease in each Blood Pressure group:")
print (df_grouped.cardio.sum() / df_grouped.cardio.count() *100)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,8))
sns.boxplot(x='cardio', y='bp', data=df_clean, ax=ax1)
#sns.barplot(x='bp', y='cardio', data=df, ax=ax2)
cardio = pd.crosstab([df_clean['bp']],df_clean.cardio.astype(bool))
cardo_rate = cardio.div(cardio.sum(1).astype(float), axis=0) # normalize the value
cardo_rate.plot(kind='barh', stacked=True, ax=ax2)
plt.show()
Let's create a new feature correlation heatmap, but instead use the new features we created. In this case, we can see a better correlation between our groups for BMI and BP with respect to our reponse variable (cardio). Please note, we used a mask below to set correlations less than .2 to 0, for a better visualization.
features = ['age', 'gender', 'bmiGrp', 'bp', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']
plt.figure(figsize=(15,15))
# Use an easier to see colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
correlation = df_clean[features].corr()
correlation[np.abs(correlation)<.2] = 0
sns.heatmap(correlation, annot = True, cmap=cmap).set(title = 'Correlation Heatmap')
plt.show()
# Full Model
X_cols = ['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alco', 'active']
# *dcrouthamel - test with new features
# X_cols = ['age', 'gender', 'bmiGrp', 'bp', 'cholesterol', 'gluc', 'smoke', 'alco', 'active']
X = df_clean[X_cols].to_numpy()
type(X)
y = df_clean['cardio'].to_numpy()
Full Model consisting of all features with standardized values.
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)
# logreg = LogisticRegression()
# logreg.fit(X_train, y_train)
logreg = LogisticRegression()
logreg.fit(X_train_std, y_train)
y_pred = logreg.predict(X_test_std)
print('Accuracy of the log reg model on the test data: {:.2f}'.format(logreg.score(X_test_std, y_test)))
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
from IPython.display import Markdown as md
### Logistic Regression Metrics
md(f"**True Positives:** {confusion_matrix[1, 1]} \n\n **True Negatives:** {confusion_matrix[0, 0]} \n\n **False Positives:** {confusion_matrix[0, 1]} \n\n **False Negatives:** {confusion_matrix[1, 0]}")
md(f"**Accuracy:** { format(( confusion_matrix[1, 1] + confusion_matrix[0, 0] ) / confusion_matrix.sum(), '.3f')}\n\n-how often we were correct overall")
md(f"**Error:** { format(( confusion_matrix[0, 1] + confusion_matrix[1, 0] ) / confusion_matrix.sum(), '.3f')}\n\n-how often we were incorrect overall")
md(f"**Sensitivity/ Recall:** { format(( confusion_matrix[1, 1] ) / confusion_matrix[1].sum(axis=0), '.3f')}\n\n-when the patient actually had CVD, how often were we correct")
md(f"**Specificity:** { format(( confusion_matrix[0, 0] ) / confusion_matrix[0].sum(), '.3f')}\n\n-when the patient did not had CVD, how often were we correct")
md(f"**False Postive Rate:** { format(( confusion_matrix[0, 1] ) / ( confusion_matrix[0, 0] + confusion_matrix[0, 1] ), '.3f')}\n\n-when the patient did not had CVD, how often were we incorrect")
md(f"**Precision:** { format(( confusion_matrix[1, 1] ) / ( confusion_matrix[1, 1] + confusion_matrix[0, 1] ), '.3f')}\n\n-how precise were we when classifying the patient as having CVD")
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_roc_curve
fig = plt.figure(1, figsize=(20, 5))
chart_1 = fig.add_subplot(121)
chart_2 = fig.add_subplot(122)
plot_confusion_matrix(logreg, X_test_std, y_test, normalize='true', ax=chart_1)
chart_1.set_title('Confusion Matrix')
plot_roc_curve(logreg, X_test_std, y_test, ax=chart_2)
chart_2.set_title('ROC Curve')
plt.show()
Principal Component Analysis is a data compression technique that can reduce the dimensionality of a data set. It does this by finding the maximum variance in a higher dimensional space and project that onto a new space with fewer dimensions. Although our data set doesn't have a hugh number a features, let's explore what PCA can do for us.
First, we'll define a function that can plot two principal components and a decison boundary. This code was taken from chapter 4 of Python Machine Learning, by Vahid Mirjalili and Sebastian Raschka.
# Chapter 4 of book
from matplotlib.colors import ListedColormap
def plot_decision_regions(X, y, classifier, resolution=0.02):
# setup marker generator and color map
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
# plot the decision surface
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# plot examples by class
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0],
y=X[y == cl, 1],
alpha=0.6,
color=cmap(idx),
edgecolor='black',
marker=markers[idx],
label=cl)
Below, we'll create a pca classifier with two components and fit the reduced data set to a logistic regression model.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
pca = PCA(n_components=2)
# sc = StandardScaler()
# X_train_std = sc.fit_transform(X_train)
# X_test_std = sc.transform(X_test)
# dimensionality reduction:
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# fitting the logistic regression model on the reduced dataset:
logreg = LogisticRegression()
logreg.fit(X_train_pca, y_train)
plot_decision_regions(X_train_pca, y_train, classifier=logreg)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.tight_layout()
plt.show()
pca.explained_variance_ratio_
Above, we can see that the two components account nearly 80 percent of the variance. The figure though doesn't show a good separation. This could be due to the fact that we need to visualize it in more than 2 dimensions, i.e., run PCA to account for a higher number of components and variance.
Below we output some metrics related to using the reduced dimensionality set in a Logistic Regression model. We see that the accuracy is comparable to the full model created previously. Accuracy is an acceptable metric for a balance data set. However, in the case of medical diagnosis, Recall or Sensitivity is an important metric. It desribes the proportion of patients correctly diagnosed with CVD. If this number is low, patients won't be corretly identified and won't receive the treatment they should.
PCA shows .62 for recall, whereas the previous model is .67. Because of that, I would go with the full model.
y_pred = logreg.predict(X_test_pca)
print('Accuracy of the log reg model on the test data: {:.2f}'.format(logreg.score(X_test_pca, y_test)))
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_roc_curve
fig = plt.figure(1, figsize=(20, 5))
chart_1 = fig.add_subplot(121)
chart_2 = fig.add_subplot(122)
plot_confusion_matrix(logreg, X_test_pca, y_test, normalize='true', ax=chart_1)
chart_1.set_title('Confusion Matrix')
plot_roc_curve(logreg, X_test_pca, y_test, ax=chart_2)
chart_2.set_title('ROC Curve')
plt.show()
PCA (unsupervised) does not use the target variable. It attempts to maximize the variance in the feature set. LDA (supervised) can use our target information and maximize the class separability.
Note that LDA will produce N-1 components, where N is the number of classes in our target. So in our case, it would produce 1 components.
We can use the below to generate an LDA feature space and train a logistic regression model against it.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
lda = LinearDiscriminantAnalysis()
# sc = StandardScaler()
# X_train_std = sc.fit_transform(X_train)
# X_test_std = sc.transform(X_test)
X_train_lda = lda.fit_transform(X_train, y_train)
X_test_lda = lda.transform(X_test)
# fitting the logistic regression model on the lda dataset:
logreg = LogisticRegression()
logreg.fit(X_train_lda, y_train)
y_pred = logreg.predict(X_test_lda)
print('Accuracy of the log reg model on the test data: {:.2f}'.format(logreg.score(X_test_lda, y_test)))
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_roc_curve
fig = plt.figure(1, figsize=(20, 5))
chart_1 = fig.add_subplot(121)
chart_2 = fig.add_subplot(122)
plot_confusion_matrix(logreg, X_test_lda, y_test, normalize='true', ax=chart_1)
chart_1.set_title('Confusion Matrix')
plot_roc_curve(logreg, X_test_lda, y_test, ax=chart_2)
chart_2.set_title('ROC Curve')
plt.show()
Above we can see that our accuracy increased to .73 using LDA, vs .71 with PCA. Additionally, the AUC value is higher and maybe more importantly, the Recall/Sensitivity value increased to .68. I think this helps demonstrate that LDA did a better job since it was able to use our target information.